AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).
A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.
You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.
To predict whether a liability customer will buy personal loans, to understand which customer attributes are most significant in driving purchases, and identify which segment of customers to target more.
ID: Customer IDAge: Customer’s age in completed yearsExperience: #years of professional experienceIncome: Annual income of the customer (in thousand dollars)ZIP Code: Home Address ZIP code.Family: the Family size of the customerCCAvg: Average spending on credit cards per month (in thousand dollars)Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/ProfessionalMortgage: Value of house mortgage if any. (in thousand dollars)Personal_Loan: Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes)Securities_Account: Does the customer have securities account with the bank? (0: No, 1: Yes)CD_Account: Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes)Online: Do customers use internet banking facilities? (0: No, 1: Yes)CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)# Installing the libraries with the specified version.
!pip install numpy==1.25.2 pandas==1.5.3 matplotlib==3.7.1 seaborn==0.13.1 scikit-learn==1.2.2 sklearn-pandas==2.2.0 -q --user
Note:
After running the above cell, kindly restart the notebook kernel (for Jupyter Notebook) or runtime (for Google Colab), write the relevant code for the project from the next cell, and run all cells sequentially from the next cell.
On executing the above line of code, you might see a warning regarding package dependencies. This error message can be ignored as the above code ensures that all necessary libraries and their dependencies are maintained to successfully execute the code in this notebook.
# The data wranglers:
import pandas as pd
import numpy as np
# data visualization libraries:
import matplotlib.pyplot as plt
import seaborn as sns
# to split the data
from sklearn.model_selection import train_test_split
# to build the model for prediction
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# To get the scores
from sklearn.metrics import (
accuracy_score,
recall_score,
precision_score,
f1_score,
confusion_matrix
)
# to suppress unnecessary warnings
import warnings
warnings.filterwarnings("ignore")
from google.colab import drive
drive.mount('/content/drive')
data = pd.read_csv("/content/drive/MyDrive/Data Science McCombs Class/bank decision tree/Loan_Modelling.csv")
df=data.copy()
df.head()
df.tail()
df.shape
df.info()
there are a lot of numerical variables that we wil lhave to make cahtegorical and dummies through one hot encoding
df.describe()
since our target variable in this classification problem is a dummy variable this isnt very exciting yet...
on average our customers are 45, make 73k, have a family size of 2.39, spend, close to 2k a month, have at least an undergrad, and have a MEAN morgage of 56k since most have no home.
df.isnull().sum()
no nulls
df = df.drop(['ID'], axis=1)
#grab all the unique values for the numerical columns but not the floats since there will be tyoo many and there are no categoricals
sanity_check = df.select_dtypes(include=['int64']).columns
for var in sanity_check:
print(var)
print(df[var].unique())
print()
here we have some negative numbers that were probably typos
df["Experience"].replace(-1, 1, inplace=True)
df["Experience"].replace(-2, 2, inplace=True)
df["Experience"].replace(-3, 3, inplace=True)
#make some of the variables categorical
cat_cols = [
"Education",
"Personal_Loan",
"Securities_Account",
"CD_Account",
"Online",
"CreditCard",
"ZIPCode",
"Family"
]
df[cat_cols] = df[cat_cols].astype("category")
df.duplicated().sum()
Questions:
sns.countplot(df["CreditCard"])
plt.show()
credit_card_ones = df[df['CreditCard'] == 1].shape[0]
print(f"Number of customers with CreditCard = 1: {credit_card_ones}")
sns.countplot(df["Personal_Loan"])
plt.show()
this is our target variable and as we can see there are much fewer people who took the loan... our customer base is expanding rapidly so whatever patterns we see in this old data should hold true for the rest of the people in this economy. as long as we arent expanding into new markets where cultural norms and frugaility change...
sns.countplot(df["Securities_Account"])
plt.show()
sns.countplot(df["CD_Account"])
plt.show()
sns.countplot(df["Online"])
plt.show()
sns.countplot(df["Education"])
plt.show()
sns.countplot(df["Family"])
plt.show()
sns.countplot(df["ZIPCode"])
plt.show()
df["ZIPCode"].nunique()
this is good in case a storm comes along we have lots of eggs in baskets.
#create bins for the zip codes so we can model thim easier.
df["ZIPCodeZone"] = df["ZIPCode"].astype(str)
print(
"Number of unique values if we take first two digits of ZIPCode: ",
df["ZIPCodeZone"].str[0:2].nunique(),
)
df["ZIPCodeZone"] = df["ZIPCodeZone"].str[0:2]
df["ZIPCodeZone"] = df["ZIPCodeZone"].astype("category")
sns.countplot(df["ZIPCodeZone"])
plt.show()
sns.histplot(df["Mortgage"], kde=True)
plt.show()
vast majority are homeless
sns.boxplot(df["Mortgage"])
plt.show()
even people with average priced homes are outliers in the dataset
sns.histplot(df["CCAvg"], kde=True)
plt.show()
most people spend under 2k but some people spend as much as 10k. it tapers off significantlly after 3k.
sns.boxplot(df["CCAvg"])
plt.show()
sns.boxplot(df["Age"])
plt.show()
sns.histplot(df["Age"], kde=True)
plt.show()
the uniform distribution is due to the natural human lifecycle
sns.boxplot(df["Experience"])
plt.show()
sns.histplot(df["Experience"], kde=True)
plt.show()
similar reasoning to the lifecycle
sns.boxplot(df["Income"])
plt.show()
half of the population make between like 40k and 100k anything over 190 is an outlier
sns.histplot(df["Income"], kde=True)
plt.show()
we have a righ tail here that shoots up to 50k and flows down gently to the outliers above 200
Multivariate
sns.pairplot(df, hue="Personal_Loan")
plt.show()
people who make above 100k are a lot more likely to take out the loan. average spending over 3k seems to be a good splittin gpoint as well. mortgages might have an impact more than age but it is less than some others.
plt.figure(figsize=(15, 7))
sns.heatmap(df.corr(numeric_only=True), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
income and spending is the only interesting correlation
we need to now look at our categoricalk variables to see how they might impact the target variable.
pd.crosstab(df["Education"], df["Personal_Loan"]).plot(kind="bar")
plt.show()
a little increase in personal loans with more education but nothing exciting
pd.crosstab(df["Family"], df["Personal_Loan"]).plot(kind="bar")
plt.show()
nothing too interestring
pd.crosstab(df["ZIPCode"], df["Personal_Loan"]).plot(kind="bar")
plt.show()
if we had the patience and procce3ssing power this would be very interesting as some zip codes have very high levels of loans
pd.crosstab(df["ZIPCodeZone"], df["Personal_Loan"]).plot(kind="bar")
plt.show()
when its broken down by zone there is no real change but these all could be important to the model after one or two splits.
pd.crosstab(df["CCAvg"], df["Personal_Loan"]).plot(kind="bar")
plt.show()
the amount of loans is constant but as a proportion they do increase with spending
bar_plot = pd.crosstab(df["CCAvg"], df["Personal_Loan"]).plot(kind="bar", figsize=(12, 6)) # Increase figure size
# Rotate x-axis labels for better readability
plt.xticks(rotation=90, ha='center') # Rotate by 90 degrees for maximum space
plt.xlabel("CCAvg", fontsize=12) # Add x-axis label with larger font
plt.ylabel("Count", fontsize=12) # Add y-axis label
upon closer inspection they sure do increase as a pproportion
pd.crosstab(df["Age"], df["Personal_Loan"]).plot(kind="bar")
plt.show()
its pretty constant except for the tails of the distribution very young people and very old people dont seem to like our loans
pd.crosstab(df["Experience"], df["Personal_Loan"]).plot(kind="bar")
plt.show()
pd.crosstab(df["Income"], df["Personal_Loan"]).plot(kind="bar")
plt.show()
a very similar story to the spending but even more intersting
# Bin 'Income' into ranges
bins = [0, 50, 100, 150, 200, 300] # Adjust ranges based on your data
labels = ["0-50", "50-100", "100-150", "150-200", "200+"]
df["Income_Binned"] = pd.cut(df["Income"], bins=bins, labels=labels, include_lowest=True)
# Plot the crosstab with binned income
pd.crosstab(df["Income_Binned"], df["Personal_Loan"]).plot(kind="bar", figsize=(12, 6))
# Adjust plot settings
plt.xticks(rotation=45, ha='right')
plt.xlabel("Income Range")
plt.ylabel("Count")
plt.title("Income Range vs Personal Loan")
plt.tight_layout()
plt.show()
the peopel making 150-200 are like 50% likely to get our loan
pd.crosstab(df["CD_Account"], df["Personal_Loan"]).plot(kind="bar")
plt.show()
same thing if they have a cd account with us
pd.crosstab(df["Securities_Account"], df["Personal_Loan"]).plot(kind="bar")
plt.show()
pd.crosstab(df["Online"], df["Personal_Loan"]).plot(kind="bar")
plt.show()
pd.crosstab(df["CreditCard"], df["Personal_Loan"]).plot(kind="bar")
plt.show()
sns.pairplot(df, hue="Personal_Loan")
plt.show()
plt.figure(figsize=(15, 7))
sns.heatmap(df.corr(numeric_only=True), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
The professor said in the lectures not to worry about outliers and they could even be helpful in a classification problem so im ignoring them because it looks like a lot of people did the same in the discussion
#check the data types and bins you created so you can play with them
#we dont need this
df.drop("Income_Binned", axis=1, inplace=True)
# dropping Experience as it is perfectly correlated with Age, zip code bc we dont have time for that and personal loan bc its our y
X = df.drop(["Personal_Loan", "ZIPCode", "Experience"], axis=1)
Y = df["Personal_Loan"]
#creating the dummies
X = pd.get_dummies(X, columns=["ZIPCodeZone", "Family", "Education"], drop_first=True)
X = X.astype(float)
# Splitting data in train and test sets with a quarter in the test
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.25, random_state=1
)
#make sure everything looks like you expect it to here before you build anything
X.head()
#we need to make sure the test and train sets have a similar proportion of our target variable
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
this looks good we can use this split
*
# defining a function to compute the performace metrics
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred)
recall = recall_score(target, pred)
precision = precision_score(target, pred)
f1 = f1_score(target, pred)
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
#predict y using the model
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
#labels for matrix
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
#create instance of model
dtree1 = DecisionTreeClassifier(criterion="gini", random_state=1)
#fit the model to the training data
dtree1.fit(X_train, y_train)
confusion_matrix_sklearn(dtree1, X_train, y_train)
#grab the metrics store them for later
dtree1_train_perf = model_performance_classification_sklearn(
dtree1, X_train, y_train
)
dtree1_train_perf
feature_names = list(X_train.columns)
print(feature_names)
plt.figure(figsize=(20, 30))
#plot the tree
out = tree.plot_tree(
dtree1,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
#add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
the left side of the tree gets fatirly developed early on and the right side takes a bit longer to get the improved impurity neccessary for the leaf nodes in this initial tree. a lot of leaves have only 1 sample and this tree is too complicated to read
print(tree.export_text(dtree1, feature_names=feature_names, show_weights=True))
#compute the Gini importance for this overly complicated tree because a lot of these features wont even show up in our pruned trees
print(
pd.DataFrame(
dtree1.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
importances = dtree1.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
this model is practically useless other than the feature importances here. surprisingly, average spending has less impact than education and family size. income is the most useful and age is somewhat important everything else will likely not be included at all in a pruned tree.
confusion_matrix_sklearn(dtree1,X_test,y_test)
dtree1_test_perf = model_performance_classification_sklearn(dtree1,X_test,y_test)
dtree1_test_perf
The idea is to get simpler model which we can use to bin customers so we can do segmentatioon rather than more or less personalized marketing. This tree is too complex for our marketing budget. it doesnt do a necessarily bad job on the test data but its not practical to use this for advertising.
i decided to build a model that tries to maximize recall because i want to see if we can find all the people who took out the loan the last time around. we dont want to neglect FP's because that s a waste of resources but its still a good idea to try and figure out where all the actual positives are so we can understand who NEEDS to see an ad.
# Define the parameters
max_depth_values = np.arange(2, 11, 2)
max_leaf_nodes_values = [10, 15, 25, 50, 75, 150, 250]
min_samples_split_values = [10, 20, 30, 50, 70]
# Initialize variables to store the best model and its performance
best_estimator = None
best_score_diff = float('inf')
best_test_score = 0.0
# Iterate over all combinations of the specified parameter values
for max_depth in max_depth_values:
for max_leaf_nodes in max_leaf_nodes_values:
for min_samples_split in min_samples_split_values:
# Initialize the tree with the current set of parameters
estimator = DecisionTreeClassifier(
max_depth=max_depth,
max_leaf_nodes=max_leaf_nodes,
min_samples_split=min_samples_split,
class_weight='balanced', # Changed 'lass_weight' to 'class_weight' to address class imbalance
random_state=1
)
# Fit the model to the training data
estimator.fit(X_train, y_train)
# Make predictions on the training and test sets
y_train_pred = estimator.predict(X_train)
y_test_pred = estimator.predict(X_test)
# Calculate recall scores
train_recall_score = recall_score(y_train, y_train_pred)
test_recall_score = recall_score(y_test, y_test_pred)
# Calculate the absolute difference between training and test recall scores
score_diff = abs(train_recall_score - test_recall_score)
# Update the best estimator and best score if the current one has a smaller score difference
if (score_diff < best_score_diff) & (test_recall_score > best_test_score):
best_score_diff = score_diff
best_test_score = test_recall_score
best_estimator = estimator
# Print the best parameters
print("Best parameters found:")
print(f"Max depth: {best_estimator.max_depth}")
print(f"Max leaf nodes: {best_estimator.max_leaf_nodes}")
print(f"Min samples split: {best_estimator.min_samples_split}")
print(f"Best test recall score: {best_test_score}")
After iterating over an initial set of parameters with the goal of finding the model with the best recall score difference between test and train to satisfy the business requirements using a simpler tree we found that the Best parameters were: Max depth: 2 Max leaf nodes: 10 Min samples split: 10 Best test recall score: 1.0 . We still need to look at how the model does on our other performance metrics to see if we can use it...
# Fit the best algorithm to the data.
dtree2 = best_estimator
dtree2.fit(X_train, y_train)
confusion_matrix_sklearn(dtree2, X_train, y_train)
dtree2_train_perf = model_performance_classification_sklearn(dtree2, X_train, y_train)
dtree2_train_perf
confusion_matrix_sklearn(dtree2, X_test, y_test)
dtree2_test_perf = model_performance_classification_sklearn(dtree2, X_test, y_test)
our precision is horrible... but this tree will help us understand simply what groups our actual positives are in.
feature_names = list(X_train.columns)
plt.figure(figsize=(20, 20))
# plotting the decision tree
out = tree.plot_tree(
dtree2, # decision tree classifier model
feature_names=feature_names, # list of feature names (columns) in the dataset
filled=True, # fill the nodes with colors based on class
fontsize=9,
node_ids=False,
class_names=None,
)
# add arrows to the decision tree splits if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
if tthe customer makes less than 90k then we look at how much they spend on their credit card, if they make less we look at how educated they are... using these features gives us a perfect recall. we were able to find all the instances where someone purchased a loan but a large proportion predicted to be positive are false positive. this model would lead to a waste of marketing resources and potential risk management issues.
this is interesting though because it hints at what some of the features are that we should be lo0oking at.
# printing a text report showing the rules of a decision tree
print(
tree.export_text(
dtree2,
feature_names=feature_names,
show_weights=True
)
)
the third tree will be a second pre pruned tree that will look at the diffrent iterator variables and will pick the model that gets the smallest difference in f1 scores between the test and training sets
# Define the parameters of the tree to iterate over
max_depth_values = np.arange(2, 11, 2)
max_leaf_nodes_values = [10, 15, 25, 50, 75, 150, 250]
min_samples_split_values = [10, 20, 30, 50, 70]
# Initialize variables to store the best model and its performance
best_estimator = None
best_score_diff = float('inf')
best_test_score = 0.0
# Iterate over all combinations of the specified parameter values
for max_depth in max_depth_values:
for max_leaf_nodes in max_leaf_nodes_values:
for min_samples_split in min_samples_split_values:
# Initialize the tree with the current set of parameters
estimator = DecisionTreeClassifier(
max_depth=max_depth,
max_leaf_nodes=max_leaf_nodes,
min_samples_split=min_samples_split,
class_weight='balanced',
random_state=1
)
# Fit the model to the training data
estimator.fit(X_train, y_train)
# Make predictions on the training and test sets
y_train_pred = estimator.predict(X_train)
y_test_pred = estimator.predict(X_test)
# Calculate f1 scores for training and test sets
train_f1_score = f1_score(y_train, y_train_pred)
test_f1_score = f1_score(y_test, y_test_pred)
# Calculate the absolute difference between training and test f1 scores
score_diff = abs(train_f1_score - test_f1_score)
# Update the best estimator and best score if the current one has a smaller score difference
if (score_diff < best_score_diff) & (test_f1_score > best_test_score):
best_score_diff = score_diff
best_test_score = test_f1_score
best_estimator = estimator
print("Best parameters found:")
print(f"Max depth: {best_estimator.max_depth}")
print(f"Max leaf nodes: {best_estimator.max_leaf_nodes}")
print(f"Min samples split: {best_estimator.min_samples_split}")
print(f"Best test f1 score: {best_test_score}")
here the parameters are wildly different from our first pre pruned tree so we will keep the hyperparameters broad and since the f1 score with the smallest difference is a relatively low score we will have to look for a model that has the highest f1 on the test data instead.
# Define the parameters of the tree to iterate over
max_depth_values = np.arange(2, 7, 2)
max_leaf_nodes_values = [15, 25, 50, 75, 150, 250]
min_samples_split_values = [8, 10, 20, 30, 50, 70]
# Initialize variables to store the best model and its performance
best_estimator = None
best_test_f1_score = 0.0
# Iterate over all combinations of the specified parameter values
for max_depth in max_depth_values:
for max_leaf_nodes in max_leaf_nodes_values:
for min_samples_split in min_samples_split_values:
# Initialize the tree with the current set of parameters
estimator = DecisionTreeClassifier(
max_depth=max_depth,
max_leaf_nodes=max_leaf_nodes,
min_samples_split=min_samples_split,
class_weight='balanced',
random_state=1
)
# Fit the model to the training data
estimator.fit(X_train, y_train)
# Make predictions on the test set
y_test_pred = estimator.predict(X_test)
# Calculate the F1 score for the test set
test_f1_score = f1_score(y_test, y_test_pred)
# Update the best estimator if the current one has a higher test F1 score
if test_f1_score > best_test_f1_score:
best_test_f1_score = test_f1_score
best_estimator = estimator
print("Best parameters found:")
print(f"Max depth: {best_estimator.max_depth}")
print(f"Max leaf nodes: {best_estimator.max_leaf_nodes}")
print(f"Min samples split: {best_estimator.min_samples_split}")
print(f"Best test F1 score: {best_test_f1_score}")
# Fit the best algorithm to the data.
dtree3 = best_estimator
dtree3.fit(X_train, y_train)
dtree3_train_perf = model_performance_classification_sklearn(dtree3,X_train,y_train)
dtree3_train_perf
confusion_matrix_sklearn(dtree3, X_train, y_train)
confusion_matrix_sklearn(dtree3, X_test, y_test)
dtree3_test_perf = model_performance_classification_sklearn(dtree3,X_test,y_test)
dtree3_test_perf
the precision looks a lot better... we arent over estimating our marketing capacity! we are still getting all of the people who took the loans and our precision indicates that we are wasting less money advertising to people who will never get our product. now close to 80% of our predicted positives are engaging!
feature_names = list(X_train.columns)
plt.figure(figsize=(20, 20))
# plotting the decision tree
out = tree.plot_tree(
dtree3, # decision tree classifier model
feature_names=feature_names, # list of feature names (columns) in the dataset
filled=True, # fill the nodes with colors based on class
fontsize=9,
node_ids=False,
class_names=None,
)
# add arrows to the decision tree splits if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
Personally, I like these results better than the previous one. HOWEVER, it is too complicated for me to develop marketing startegies. we have limited time and patience so we will build one last tree in the hopes of simplifying the process further whilke keeping the purity of the tree in tact.
there are lots of ways to tune these hyperparameters depending on the business requirements and data you are working with. i would ask clkarifyinng questions to get a better idea about which of these would be best but ultimatelky the post pruned tree willl probably be the preffered one anyways.
our marketing deparement may choose this model but i wont get full marks unless i grow one last tree.
Post Pruning
#create an instance of a model
clf = DecisionTreeClassifier(random_state=1)
#compute the cost complexity pruning path for the model on the training data
path = clf.cost_complexity_pruning_path(X_train, y_train)
#grab all the effective alphas from the pruning path
ccp_alphas = abs(path.ccp_alphas)
#find the impurities corresponding to each alpha along the pruning path
impurities = path.impurities
pd.DataFrame(path)
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set");
Once alpha is less than .0025 or so there isnt much reduction in the impurity
#initialize an empty list
clfs = []
#iterate over each ccp alpha on the pruning path
for ccp_alpha in ccp_alphas:
#create an instance of a decsion tree
clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)
clf.fit(X_train, y_train)
clfs.append(clf)
print(
"Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]
)
)
#remove the trivial one node for bookeeping
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
#get the number of nodes for each
node_counts = [clf.tree_.node_count for clf in clfs]
#extract the max depth for each
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
#to avoid overlap
fig.tight_layout()
when alpha is less than .0025 or so the tree is too complex...
#create an empy list for our recalls
recall_train = []
#iterate through each of the decision tree classifiers in clfs
for clf in clfs:
#predict labels for the training set using the current tree classifier
pred_train = clf.predict(X_train)
values_train = recall_score(y_train, pred_train)
recall_train.append(values_train)
#same for test so we can plot both and pick a good alpha
recall_test = []
for clf in clfs:
pred_test = clf.predict(X_test)
values_test = recall_score(y_test, pred_test)
recall_test.append(values_test)
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(ccp_alphas, recall_train, marker="o", label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, recall_test, marker="o", label="test", drawstyle="steps-post")
ax.legend();
the best compromise for complexity and simplicity is around 0.0025 or so for all the different graphs used to pick a best alpha
i chose .0025 as my alpha but tthat tree was too busy so i switched it to .005
#create the model where recall is highest
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
i had an issue where the best alpha for the highest recall was 1 so i opted to choose my own as i got too complicated of a tree...
best_alpha = 0.005
print("Best model ccp_alpha:", best_alpha)
estimator_2 = DecisionTreeClassifier(ccp_alpha=0.005, class_weight={0: 0.15, 1: 0.85}, random_state=1)
estimator_2.fit(X_train, y_train)
print("Best model ccp_alpha:", best_alpha)
the model automatically chose 0 because i told it to make dtree3 the one with the best recall on the testing data... thats not reallly what i want so im going to choose an alpha myself based on the graph because i cant get the model to choose a reasonable best alpha.
best_alpha = 0.005
print("Best model ccp_alpha:", best_alpha)
estimator_2 = DecisionTreeClassifier(
ccp_alpha=best_alpha,
class_weight={0: 0.15, 1: 0.85},
random_state=1
)
estimator_2.fit(X_train, y_train)
confusion_matrix_sklearn(estimator_2, X_train, y_train)
dtree4_train_perf = model_performance_classification_sklearn(estimator_2, X_train, y_train)
dtree4_train_perf
i was worried about the perfectly trained post pruned tree being a sign of soem kind of error but the discussion forum and chat gpt told me not to worry.
confusion_matrix_sklearn(estimator_2, X_test, y_test)
dtree4_test_perf = model_performance_classification_sklearn(estimator_2, X_test, y_test)
dtree4_test_perf
plt.figure(figsize=(10, 10))
out = tree.plot_tree(
estimator_2,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
this is more complicated than i would like it to be in a perfect world but it is probably as simple as the dataset allows without completely sacrificing our impurity measures
this is very good overall not just for recall and allows us to develop sophiusticated marketing strategies
models_train_comp_df = pd.concat(
[dtree1_train_perf.T, dtree2_train_perf.T, dtree3_train_perf.T, dtree4_train_perf.T], axis=1,
)
models_train_comp_df.columns = ["Decision Tree (sklearn default)", "Decision Tree (Pre-Pruning-Recall)", "Decision Tree (Pre-Pruning-F1)", "Decision Tree (Post-Pruning)"]
print("Training performance comparison:")
models_train_comp_df
models_test_comp_df = pd.concat(
[dtree1_test_perf.T, dtree2_test_perf.T, dtree3_test_perf.T, dtree4_test_perf.T], axis=1,
)
models_test_comp_df.columns = ["Decision Tree (sklearn default)", "Decision Tree (Pre-Pruning-Recall)", "Decision Tree (Pre-Pruning-F1)", "Decision Tree (Post-Pruning)"]
print("Test set performance comparison:")
models_test_comp_df
sns.pairplot(df, hue="Personal_Loan")
plt.show()
plt.figure(figsize=(10, 10))
out = tree.plot_tree(
estimator_2,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
if the marketintg department wants something more sophisticated they can look at the second pre pruned tree and get some more in depth analysis but ill be using the psot pruned tree because i think its a good balance between complexity and simplicity at least for this project.
there a lot of nodes with low impurity. marketing is harmeless, if you send someone an ad they didnt want to see its not the end of the world but it can be expensive. the right decision is to market to the people in the dark orange nodes and stay away from the ones in the blue ones
there arent too many people in the white leaf nodes so maybe you can use expertise to do targeting ads to them or something if you wan to impact the margins of your bottom line
there are two main reasons our depositors took on loans
we can do payday loans for the people who need quick cash
or
for those not making 100k and spending almost 3k they might not all be paying it off we could offer them rollover credit card debt to a personal loan with low intrest! but only if their card is with another finance corp!!! we could give these guys a low interst entry rate.
or
for the families a christmas loan for them to pay off next year. maybe.